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Abstract 

Learning curves for Gaussian process regression are well under- 
stood when the 'student' model happens to match the 'teacher' 
(true data generation process). I derive approximations to the 
learning curves for the more generic case of mismatched models, 
and find very rich behaviour: For large input space dimensional- 
ity, where the results become exact, there are universal (student- 
independent) plateaux in the learning curve, with transitions in 
between that can exhibit arbitrarily many over-fitting maxima. In 
lower dimensions, plateaux also appear, and the asymptotic de- 
cay of the learning curve becomes strongly student-dependent. All 
predictions are confirmed by simulations. 

1 Introduction 

There has in the last few years been a good deal of excitement about the use 
of Gaussian processes (GPs) as an alternative to feedforward networks [1]. GPs 
make prior assumptions about the problem to be learned very transparent, and 
even though they are non-parametric models, inference — at least in the case of 
regression considered below — is relatively straightforward. One crucial question for 
applications is then how 'fast' GPs learn, i.e., how many training examples are 
needed to achieve a certain level of generalization performance. The typical (as 
opposed to worst case) behaviour is captured in the learning curve, which gives the 
average generalization error e as a function of the number of training examples n. 
Good bounds and approximations for e(n) are now available [1, 2, 3, 4], but these 
are all restricted to the case where the 'student' model exactly matches the true 
'teacher' generating the data. In practice, such a match is unlikely, and so it is 
important to understand how GPs learn if there is some model mismatch. This is 
the aim of this paper. 

In its simplest form, the regression problem is this: We are trying to learn a function 
0* which maps inputs x (real- valued vectors) to (real- valued scalar) outputs 6*(x). 
We are given a set of training data D, consisting of n input-output pairs (x l ,y l ); 
the training outputs y may differ from the 'clean' teacher outputs #*(ar) due to 
corruption by noise. Given a test input x, we are then asked to come up with a 
prediction 9{x), plus error bar, for the corresponding output 0(x). In a Bayesian 
setting, we do this by specifying a prior P{6) over hypothesis functions, and a like- 
lihood P(D\Q) with which each 9 could have generated the training data; from this 



we deduce the posterior distribution P(6\D) oc P(D\9)P(9). For a GP, the prior is 
defined directly over input-output functions 9; this is simpler than for a Bayesian 
feedforward net since no weights are involved which would have to be integrated 
out. Any 9 is uniquely determined by its output values 9{x) for all x from the in- 
put domain, and for a GP, these are assumed to have a joint Gaussian distribution 
(hence the name). If we set the means to zero (as is commonly done), this distri- 
bution is fully specified by the covariance function (9{x)9{x'))g = C(x,x'). The 
latter transparently encodes prior assumptions about the function to be learned. 
Smoothness, for example, is controlled by the behaviour of C(x, x') for x' — > x: The 
Ornstein-Uhlenbeck (OU) covariance function C(x,x') = cxp(— \x — x'\/l) produces 
very rough (non-differcntiable) functions, while functions sampled from the radial 
basis function (RBF) prior with C(x, x') = exp[— \x — x'\ 2 / (2l 2 )] arc infinitely diffcr- 
entiable. Here I is a lengthscale parameter, corresponding directly to the distance 
in input space over which we expect significant variation in the function values. 

There are good reviews on how inference with GPs works [1, 5], so I only give 
a brief summary here. The student assumes that outputs y are generated from 
the 'clean' values of a hypothesis function 9(x) by adding Gaussian noise of x- 
independent variance a 2 . The joint distribution of a set of training outputs {y 1 } 
and the function values 9(x) is then also Gaussian, with covariances given (under 
the student model) by 

(y l y m ) = C(x l ,x m )+a 2 S lm = (K) lm , (y l 9(x)} = C{x\x) = (k(x)) ; 

Here I have defined an n x n matrix K and an x-dependent n-component vector 
k(x). The posterior distribution P(9\D) is then obtained by conditioning on the 
{y 1 }; it is again Gaussian and has mean and variance 

{B{x))e\ D = 0(x\D) = k(a;) T K- 1 y (1) 

((9(x) - 9{x)) 2 ) e]D = Cix^-kixfK-^x) (2) 

From the student's point of view, this solves the inference problem: The best pre- 
diction for 9(x) on the basis of the data D is 9{x\D), with a (squared) error bar 
given by (2). The squared deviation between the prediction and the teacher is 
[9{x\D) — 9„{x)} 2 ; the generalization error (which, as a function of n, defines the 
learning curve) is obtained by averaging this over the posterior distribution of teach- 
ers, all datasets, and the test input x: 

e = {{{[9{x\D)-9»{x)] 2 ) eAD ) D ) x (3) 

Now of course the student does not know the true posterior of the teacher; to 
estimate e, she must assume that it is identical to the student posterior, giving 
from (2) 

e = ((([9(x\D) - 9(x)] 2 ) elD ) D ) x = ((C(x, x) - k(a;) T K- 1 k(a;)) {x , } ) x (4) 

where in the last expression I have replaced the average over D by one over the 
training inputs since the outputs no longer appear. If the student model matches 
the true teacher model, e and e coincide and give the Bayes error, i.e. the best 
achievable (average) generalization performance for the given teacher. 

I assume in what follows that the teacher is also a GP, but with a possibly different 
covariance function C*(x,x') and noise level a 2 . This allows eq. (3) for e to be 
simplified, since by exact analogy with the argument for the student posterior 

(9,(x))g tlD ^K(x) T K: 1 y, (9l(x)) g , lD ^(9,(x))l lD +C,(x,x)-K(x) T K: 1 K(x) 

and thus, abbreviating &(x) — K _1 k(x) — K^ x k»(x), 

e = ((a(x) T yy T a(x) + C*(x,x) - k»(x) T K7 1 k*(x)) D ) x 



Conditional on the training inputs, the training outputs have a Gaussian distribu- 
tion given by the true (teacher) model; hence (yy T }{ y i}\{ x 1 } — K*, giving 

e = ((C r (x,x) - 2K(x) T K- 1 k(x) + k(x) T K- 1 K,K- 1 k(x)) {x i } ) x (5) 
2 Calculating the learning curves 

An exact calculation of the learning curve e(n) is difficult because of the joint av- 
erage in (5) over the training inputs X and the test input x. A more convenient 
starting point is obtained if (using Mercer's theorem) we decompose the covariance 
function into its cigenfunctions <pi(x) and eigenvalues Aj, defined w.r.t. the input 
distribution so that (C(x,x')<f)i(x')} x i — Ai<fii(x) with the corresponding normaliza- 
tion (<)>i(x)<t>j(x)} x = Sij. Then 

oo oo 

C(x,x') — Aj(j>i(x)(j)i(x'), and similarly C*(x,x') = ^ A*(j>i(x)(j>i(x') (6) 

i=l i=l 

For simplicity I assume here that the student and teacher covariance functions have 
the same eigenfunctions (but different eigenvalues). This is not as restrictive as it 
may seem; several examples are given below. The averages over the test input x 
in (5) are now easily carried out: E.g. for the last term we need 

{(k(x)k(xf) lm ) x =J2^Mx l ){<f>i(x)<t>j(x))*<t>j(x m ) = E A^(*')<M* m ) 

ij i 

Introducing the diagonal eigenvalue matrix (A)jj = A^j and the 'design matrix' 
(®)u = this reads (k(x)k(x) T ) x = $A 2 $ T . Similarly, for the second term 

in (5), (k(x)kj (x)) x = $AA»$ T , and (C*(x,x)) x = trA*. This gives, dropping 
the training inputs subscript from the remaining average, 

e = (tr A* — 2 tr SAA^K" 1 + tr $A 2 $ T K" 1 K»K- 1 ) 

In this new representation we also have K = er 2 I + <f>A$ T and similarly for K„; 
for the inverse of K we can use the Woodbury formula to write K^ 1 = c~ 2 [I — 
a _2 $0$ T ], where Q — (A -1 + cr _2 $ T $) _1 . Inserting these results, one finds after 
some algebra that 

e = (j 2 (j -2 [(tr5) - (tr^A-^)] + (tr^A^- 2 ^ (7) 

which for the matched case reduces to the known result for the Bayes error [3] 

6=(trg> (8) 

Eqs. (7,8) are still exact. We now need to tackle the remaining averages over training 
inputs. Two of these are of the form (trC7MC?); if we generalize the definition of 
Q to Q = (A -1 + vl + wM + (T" 2 $ T <I>)" 1 and define g = (tr£?), then they reduce 
to (trC7MCJ) = —dg/dw. (The derivative is taken at v — w = 0; the idea behind 
introducing v will become clear shortly.) So it is sufficient to calculate g. To do 
this, consider how g changes when a new example is added to the training set. One 
has 

0( n + 1) - «») - [e-'w + - «») - - m 

in terms of the vector ip with elements (ip)i — (f>i(x n+ i), using again the Woodbury 
formula. To obtain the change in g we need the average of (9) over both the new 
training input x n +i and all previous ones. This cannot be done exactly, but we can 



approximate by averaging numerator and denominator separately; taking the trace 
then gives g(n + 1) — g(n) = — {tvQ 2 (n))/[a 2 + g(n)]. Now, using our auxiliary 
parameter v, —tr(Q 2 ) = dg/dv; if we also approximate n as continuous, we get 
the simple PDE dg/dn — (dg/dv)/(<r 2 + g) with the initial condition g| n= o = 
tr (A -1 + vl + u>M) _1 . Solving this using the method of characteristics [6] gives a 
self-consistent equation for g, 

(10) 

The Bayes error (8) is e = g\ v = w =o and therefore obeys 

e = trG, G-^A-^-^I (11) 

within our approximation (called 'LC in [3]). To obtain e, we differentiate both 
sides of (10) w.r.t. w, set v = w = and rearrange to give 

(trgMQ) = -dg/dw = (trMG 2 )/[l - (trG 2 )n/(<7 2 + e) 2 ] 

Using this result in (7), with M = A -1 and M = A _1 A*A _1 , we find after some 
further simplifications the final (approximate) result for the learning curve: 

, a 2 trG 2 +n- 1 (^ 2 +e) 2 trA,A- 2 G 2 
6 6 <T 2 trG 2 +n- 1 ( ( 7 2 + e) 2 trA- 1 G 2 ( ' 

which transparently shows how in the matched case e and e become identical. 



g = tr 



+ \v + 



I + wM 



3 Examples 

I now apply the result for the learning curve (11,12) to some exemplary learning 
scenarios. First, consider inputs x which are binary vectors 1 with d components 
x a e { — 1,1}, and assume that the input distribution is uniform. We consider 
covariance functions for student and teacher which depend on the product x ■ x' 
only; this includes the standard choices (e.g. OU and RBF) which depend on the 
Euclidean distance \x — x'\, since \x — x'\ 2 = 2d — 2x ■ x' . All these have the same 
cigenfunctions [8], so our above assumption is satisfied. The eigenfunctions are 
indexed by subsets p of {1, 2 . . . d} and given explicitly by 4> p (x) — Yl aep x a . The 
corresponding eigenvalues depend only on the size s = \p\ of the subsets and are 
therefore (f)-fold degenerate; letting e = (1, 1 ... 1) be the 'all ones' input vector, 
they have the values A s = (C(x, e)(f> p (x)) x (which can easily be evaluated as an 
average over two binomially distributed variables, counting the number of +l's in 
x overall and among the x a with a £ p). With the A s and A* determined, it is 
then a simple matter to evaluate the predicted learning curve (11,12) numerically. 
First, though, focus on the limit of large d, where much more can be said. If we 
write C(x,x') = f(x ■ x'/d), the eigenvalues become, for d — > oo, A s = d^ s f^(0) 
and the contribution to C(x,x) — /(l) from the s-th eigenvalue block is A s = 

(f)A s -» f^(0)/s\, consistent with /(l) = J2T=o f (s) (°)/ sl The A s, because of 
their scaling with d, become infinitely separated for d — > oo. For training sets of 
size n = 0(d L ), we then see from (11) that eigenvalues with s > L contribute as if 
n = 0, since A s 3> n/(a 2 + e); they have effectively not yet been learned. On the 
other hand, eigenvalues with s < L are completely suppressed and have been learnt 
perfectly. We thus have a hierarchical learning scenario, with different scalings of n 

1 This scenario may seem strange, but simplifies the determination of the eigenfunctions 
and eigenvalues; for large d, however, one expects other distributions with continuously 
varying x to give similar results [7]. 



with d — as defined by L — corresponding to different 'learning stages'. Formally, we 
can analyse the stages separately by letting d — > oo at a constant ratio a = n/{f) 
of the number of examples to the number of parameters to be learned at stage L 
(note (£) = 0{d L ) for large d). An independent (replica) calculation shows that 
our approximation for the learning curve actually becomes exact in this limit. The 
resulting a-dependence of e can be determined explicitly: Set Jl = J2s>l ^« ( so 
that /o = /(l)) and similarly for /£. Then for large a, 

e = fl +1 + (fl +1 + oDa- 1 + 0(a- 2 ) (13) 

This implies that, during successive learning stages, (teacher) eigenvalues are learnt 
one by one and their contribution eliminated from the generalization error, giving 

plateaux in the learning curve at e = /* , /| , These plateaux, as well as the 

asymptotic decay (13) towards them, are universal [7], i.e. student-independent. 
The (non-universal) behaviour for smaller a can also be fully characterized: Con- 
sider first the simple case of linear perceptron learning (see e.g. [6]), which corre- 
sponds to both student and teacher having simple dot-product covariance functions 
C(x, x') — C*(x, x') = x-x'/d. In this case there is only a single learning stage (only 
Ai = Af = 1 are nonzero), and e = r(a) decays from r(0) = 1 to r(oo) = 0, with 
an over- fitting maximum around a = 1 if a 2 is sufficiently small compared to a 2 . 
In terms of this function r(a), the learning curve at stage L for general covariance 
functions is then exactly given by e = + \* L r(a) if in the evaluation of r(a) 
the effective noise levels o 2 = (/l+i + <t 2 )/Xl and a 2 = (/£ +1 + f 2 )/A£ are used. 
Note how in a 2 , the contribution /£ +1 from the not-yet-learned eigenvalues acts as 
effective noise, and is normalized by the amount of 'signal' \* L = /j? — available 
at learning stage L. The analogous definition of a 2 implies that, for small a 2 and 
depending on the choice of student covariance function, there can be arbitrarily 
many learning stages L where a 2 <C a 2 , and therefore arbitrarily many over-fitting 
maxima in the resulting learning curves. Fig. 1 (left) demonstrates that this conclu- 
sion holds not just for d —* oo; even for the cases shown, with d = 10, up to three 
over- fitting maxima are apparent. Our theory provides a very good description of 
the numerically simulated learning curves even though, at such small d, the predic- 
tions are still significantly different from those for d — > oo (see Fig. 1 (right)) and 
therefore not guaranteed to be exact. 

In the second example scenario, I consider continuous-valued input vectors, uni- 
formly distributed over the unit interval [0,1]; generalization to d dimensions 
(x € [0, l] d ) is straightforward. For covariance functions which are stationary, i.e. 
dependent on x and x' only through x — x' , and assuming periodic boundary condi- 
tions (see [3] for details), one then again has covariance function-independent eigen- 
functions. They are indexed by integers 2 q, with 4> q (x) — e 2mq ' x ; the corresponding 
eigenvalues are A q = Jdx C(0, x)e~ 2 ' K%q ' x . For the ('periodified') RBF covariance 
function C(x,x') — exp[— (x — x') 2 /(2l 2 )], for example, one has A g oc cxp(— q 2 /2), 
where q = 2irlq. The OU case C(x,x') = cxp(— \x — x'\/l), on the other hand, 
gives A 9 oc (1 + <7 2 ) -1 , thus A q oc q~ 2 for large q. I also consider below covariance 
functions which interpolate in smoothness between the OU and RBF limits: E.g. 
the MB2 (modified Bessel) covariance C(x,x') = e~ a (l + a), with a = \x — x'\/l, 
yields functions which are once differentiable [4]; its eigenvalues A q oc (1 + <7 2 )~ 2 
show a faster asymptotic power law decay, A q oc q~ 4 . To subsume all these cases 
I assume in the following analysis of the general shape of the learning curves that 
A q oc q~ r (and similarly A* oc q~ r *). Here r = 2 for OU, r = 4 for MB2, and (due 
to its faster-than-power law decay of eigenvalues) effectively r = oo for RBF. 

2 Since A 9 = A_ g , one can assume q > if all A 9 for q > are taken as doubly 
degenerate. 
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Figure 1: Left: Learning curves for RBF student and teacher, with uniformly dis- 
tributed, binary input vectors with d = 10 components. Noise levels: Teacher 
a 2 = 1, student cr 2 = 1CP 4 , 10~ 3 , . . . , 1 (top to bottom). Length scales: Teacher 
/* = d 1 / 2 , student I = 2d 1 / 2 . Dashed: numerical simulations, solid: theoretical 
prediction. Right: Learning curves for cr 2 = 1CP 4 and increasing d (top to bottom: 
10, 20, 30, 40, 60, 80, [bold] oo). The x-axis shows a — n/(i), for learning stages 
L = 1,2, 3; the dashed lines are the universal asymptotes (13) for d — > oo. 



From (11,12), it is clear that the n-dependence of the Bayes error e has a strong 
effect on the true generalization error e. From previous work [3], we know that e(n) 
has two regimes: For small n, where e > <r 2 , e is dominated by regions in input 
space which are too far from the training examples to have significant correlation 
with them, and one finds e oc riT^^ 1 ^. For much larger n, learning is essentially 
against noise, and one has a slower decay e oc n~( r ~ 1 ^ r . These power laws can be 
derived from (11) by approximating factors such as [A^ 1 + n/(a 2 + e)] _1 as equal 
to either A q or to 0, depending on whether n/(a 2 + e) < or > A^ 1 . With the same 
technique, one can estimate the behaviour of e from (12). In the small n-regime, one 
finds e ~ c\a 2 + c 2 n~( r *~ 1 \ with prefactors ci, C2 depending on the student. Note 
that the contribution proportional to a 2 is automatically negligible in the matched 
case (since then e = e 3> cr 2 = a 2 for small n); if there is a model mismatch, however, 
and if the small- n regime extends far enough, it will become significant. This is the 
case for small cr 2 ; indeed, for a 2 — > 0, the 'small n'' criterion e » a 2 is satisfied for 
any n. Our theory thus predicts the appearance of plateaux in the learning curves, 
becoming more pronounced as a 2 decreases; Fig. 2(left) confirms this 3 . Numerical 
evaluation also shows that for small cr 2 , over-fitting maxima may occur before the 
plateau is reached, consistent with simulations; see inset in Fig. 2(right). In the 
large n- regime (e <C cr 2 ), our theory predicts that the generalization error decays as 
a power law. If the student assumes a rougher function than the teacher provides 
(r < r„), the asymptotic power law exponent e oc n _ ( r_1 ^ r is determined by the 
student alone. In the converse case, the asymptotic decay is e oc 7i~( r *~ 1 )/ r and 
can be very slow, actually becoming logarithmic for an RBF student (r — > oo). For 
r = r* , the fastest decay for given r* is obtained, as expected from the properties of 
the Bayes error. The simulation data in Fig. 2 are compatible with these predictions 
(though the simulations cover too small a range of n to allow exponents to be 
determined precisely). 



3 If a 2 — exactly, the plateau will extend to n — > oo. With hindsight, this is clear: 
a GP with an infinite number of nonzero eigenvalues has no limit on the number of its 
'degrees of freedom' and can fit perfectly any amount of noisy training data, without ever 
learning the true teacher function. 
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Figure 2: Learning curves for inputs x uniformly distributed over [0,1]. Teacher: 
MB2 covariance function, lengthscale U = 0.1, noise level a 2 — 0.1; student length- 
scale I = 0.1 throughout. Dashed: simulations, solid: theory. Left: OU student 
with a 2 as shown. The predicted plateau appears as a 2 decreases. Right: Stu- 
dents with a 2 = 0.1 and covariance function as shown; for clarity, the RBF and 
OU results have been multiplied by VTO and 10, respectively. Dash-dotted lines 
show the predicted asymptotic power laws for MB2 and OU; the RBF data have a 
persistent upward curvature consistent with the predicted logarithmic decay. Inset: 
RBF student with a 2 = 10~ 3 , showing the occurrence of over-fitting maxima. 



In summary, the above approximate theory makes a number of non-trivial predic- 
tions for GP learning with mismatched models, all borne out by simulations: for 
large input space dimensions, the occurrence of multiple over-fitting maxima; in 
lower dimensions, the generic presence of plateaux in the learning curve if the stu- 
dent assumes too small a noise level a 2 , and strong effects of model mismatch on 
the asymptotic learning curve decay. The behaviour is much richer than for the 
matched case, and could guide the choice of (student) priors in real-world appli- 
cations of GP regression; RBF students, for example, run the risk of very slow 
(logarithmic) decay of the learning curve if the target (teacher) is less smooth than 
assumed. An important issue for future work is to analyse to which extent hyper- 
parameter tuning (e.g. via evidence maximization) can make GP learning robust 
against some forms of model mismatch, e.g. a misspecified functional form of the 
covariance function. 
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